EPrints Technical Mailing List Archive

Message: #00478


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

[EP-tech] Re: Garbage indexing some pdf


On 04/27/2012 02:02 PM, rchilliard@mun.ca wrote:
Hi,

regarding to your p.s., we noticed on our repository many records with words badly indexed (with non-breaking space character, or other similar stuff). A (very) dirty quick patch for Tokenize.pm to add the most frequent breaking characters found in our fulltext.

Index: Tokenizer.pm
===================================================================
--- Tokenizer.pm    (revision 323)
+++ Tokenizer.pm    (working copy)
@@ -259,8 +259,63 @@
     '.' => 1,     '/' => 1,     ':' => 1,     ';' => 1,
     '{' => 1,     '<' => 1,     '|' => 1,     '=' => 1,
     '}' => 1,     '>' => 1,     '~' => 1,     '?' => 1,
- chr(0xb4) => 1, chr(0x27)=>1, '{' => 1, '}' => 1 # Acute Accent (closing quote)
+    chr(0xb4) => 1, chr(0x27)=>1,   '{' => 1,       '}' => 1,
+chr(0x81) => 1,
+chr(0x83) => 1,
+chr(0x00a0) => 1,
+chr(0x0090) => 1,
+chr(0x0099)  => 1,
+chr(0x009c)  => 1,
+chr(0x009d) => 1,
+chr(0x02B9) => 1, # ca b9    MODIFIER LETTER PRIME
+chr(0x02BA) => 1, # ca ba    MODIFIER LETTER DOUBLE PRIME
+chr(0x02BB) => 1, # ca bb    MODIFIER LETTER TURNED COMMA
+chr(0x02BC) => 1, # ca bc       MODIFIER LETTER APOSTROPHE
+chr(0x02BD) => 1, # ca bd    MODIFIER LETTER REVERSED COMMA
+chr(0x02BE) => 1, # ca be    MODIFIER LETTER RIGHT HALF RING
+chr(0x02BF) => 1, # ca bf    MODIFIER LETTER LEFT HALF RING
+chr(0x2000) => 1, # e2 80 80    EN QUAD
+chr(0x2001) => 1, # e2 80 81    EM QUAD
+chr(0x2002) => 1, # e2 80 82    EN SPACE
+chr(0x2003) => 1, # e2 80 83    EM QUAD
+chr(0x2004) => 1, # e2 80 84    THREE-PER-EM SPACE
+chr(0x2005) => 1, # e2 80 85    FOUR-PER-EM SPACE
+chr(0x2006) => 1, # e2 80 86    SIX-PER-EM SPACE
+chr(0x2007) => 1, # e2 80 87    FIGURE SPACE
+chr(0x2008) => 1, # e2 80 87    PUNCTUATION SPACE
+chr(0x2009) => 1, # e2 80 87    THIN SPACE
+chr(0x200A) => 1, # e2 80 87    HAIR SPACE
+chr(0x200B) => 1, # e2 80 87    ZERO WIDTH SPACE
+chr(0x2024)  => 1, # e2 80 a4    ONE DOT LEADER
+chr(0x2025)  => 1, # e2 80 a5   TWO DOT LEADER
+chr(0x2026)  => 1, # e2 80 a6   HORIZONTAL ELLIPSIS
+chr(0x2027)  => 1, # e2 80 a7   HYPHENATION POINT
+chr(0x2028)  => 1, # e2 80 a8   LINE SEPARATOR
+chr(0x2029)  => 1, # e2 80 a9   PARAGRAPH SEPARATOR
+chr(0x2018) => 1,  # e2 80 98    LEFT SINGLE QUOTATION MA
+chr(0x2019) => 1, # e2 80 99    RIGHT SINGLE QUOTATION MARK
+chr(0x201c) => 1, # e2 80 9c    LEFT DOUBLE QUOTATION MARK
+chr(0x201d) => 1,  # e2 80 9d    RIGHT DOUBLE QUOTATION MARK
+chr(0x2010) => 1,  # e2 80 90    HYPHEN
+chr(0x2011) => 1,  # e2 80 91    NON-BREAKING HYPHEN
+chr(0x2012) => 1,  # e2 80 92    FIGURE DASH
+chr(0x2013) => 1,  # e2 80 93    EN DASH
+chr(0x2014) => 1,  # e2 80 94    EM DASH
+chr(0x2015) => 1,  # e2 80 95    HORIZONTAL BAR
+chr(0xFB00) => 1,  #ef ac 80    LATIN SMALL LIGATURE FF
+chr(0xFB01) => 1,  #ef ac 81    LATIN SMALL LIGATURE FI
+chr(0xFB02) => 1,  #ef ac 82    LATIN SMALL LIGATURE FL
+chr(0xFB03) => 1,  #ef ac 83    LATIN SMALL LIGATURE FFI
+chr(0xFB04) => 1,  #ef ac 84    LATIN SMALL LIGATURE FFL
+chr(0xFB05) => 1,  #ef ac 85    LATIN SMALL LIGATURE LONG S T
+chr(0xFB06) => 1,  #ef ac 86    LATIN SMALL LIGATURE ST
+chr(0xFFF9 ) => 1,  #ef bf b9 INTERLINEAR ANNOTATION ANCHOR
+chr(0xFFFA ) => 1,  #ef bf ba INTERLINEAR ANNOTATION SEPARATOR
+chr(0xFFFB ) => 1,  #ef bf bb INTERLINEAR ANNOTATION TERMINATOR
+chr(0xFFFC ) => 1,  #ef bf bc OBJECT REPLACEMENT CHARACTER
+chr(0xFFFD ) => 1  #ef bf bd REPLACEMENT CHARACTER
 };
+
$EPrints::Index::FREETEXT_SEPERATOR_REGEXP = quotemeta(join "", keys %$EPrints::Index::FREETEXT_SEPERATOR_CHARS); $EPrints::Index::FREETEXT_SEPERATOR_REGEXP = qr/[$EPrints::Index::FREETEXT_SEPERATOR_REGEXP\x00-\x20]/;

Best regards,
Paolo
Hi Paolo,

    I took a quick peek at the sample that you were able to provide, and it looks like the character mapping is missing for the content text. If you export the PDF to text via Acrobat or equivalent, you can note via hex editor that the output text file has all characters mapped to ascii(0x2e), via a vanilla run of pdftotext (e.g. pdftotext test.pdf test.txt), the characters are mapped to ascii(0x20) and in unicode from pdftotext (as the command run by the indexer ~= pdftotext -enc UTF-8 -test.pdf test_utf.txt) you get the byte sequence "ef 80 bd" for each character.

    It may be possible to retroactively reconstitute the mapping information, but I'm not aware of a mechanism to do perform that operation. As well, it appears that this might have been done purposely when the PDF was generated - most tellingly, the licensing / attribution information at the conclusion of the file is mapped properly.

p.s. thank you for the note / query on testing the indexed word lengths, it has notified us of a potential issue in our repository (and possibly others'?) whereby multiple words are being indexed in clusters because they are not tokenized on non-breaking space ('&nbsp') characters.


--
Ing. Paolo Tealdi         Area IT - Politecnico Torino
Telefono/Phone : +39-011-0906714 , FAX : +39-011-0906799
Indirizzo/Address : C.so Duca degli Abruzzi,  24 - 10129 Torino - ITALY
Skype : tealdi.paolo
Please consider your environmental responsibility before printing this e-mail